Reverse Experience Replay

1 Overview

Reverse Experience Replay (RER) was proposed by E. Rotinov1 to overcome environments with delayed (sparse) rewards. Since many transitions doesn’t have immediate reward, random sampling is inefficient.

In RER, equally strided transitions are sampled from the latest transition. The next sample contains one step older transions.

\[ \begin{align} B_1 &= \lbrace& T_{t} &, T_{t-stride} &, \dots &, T_{t-batch~size \times stride} &\rbrace \\ B_2 &= \lbrace& T_{t-1}&, T_{t-stride-1}&, \dots &, T_{t-batch~size \times stride - 1} &\rbrace \\ &\vdots&&&&&& \end{align} \]

When the first sample index (\(t-i\)) becomes \(2 \times stride\) old from the latest transition, the first sample index is reset to the latest transition.

Parameters Default Description
stride 300 Sample stride

2 Example Usage

The usage of ReverseReplayBuffer is same as the usage of ordinary ReplayBuffer.

from cpprb import ReverseReplayBuffer

buffer_size = 256
obs_shape = 3
act_dim = 1
stride = 20

rb = ReverseReplayBuffer(buffer_size,
                         env_dict = {"obs": {"shape": obs_shape},
                                     "act": {"shape": act_dim},
                                     "rew": {},
                                     "next_obs": {"shape": obs_shape},
                                     "done": {}},
                         stride = stride)

obs = np.ones(shape=(obs_shape))
act = np.ones(shape=(act_dim))
rew = 0
next_obs = np.ones(shape=(obs_shape))
done = 0

for i in range(500):

    if done:
        # Together with resetting environment, call ReplayBuffer.on_episode_end()

batch_size = 32
sample = rb.sample(batch_size)
# sample is a dictionary whose keys are 'obs', 'act', 'rew', 'next_obs', and 'done'

3 Notes

The author indicated stride size must not be multiple of the length of episode horizon to avoid sampling similar transitions simultaneously.

4 Technical Detail

  1. E. Rotinov, “Reverse Experience Replay” (2019), (arXiv:1910.08780↩︎